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Abstract 

In order to overcome the problem of item nonresponse, random imputation 
methods are often used because they tend to preserve the distribution of the 
imputed variable. Among the random imputation methods, the random hot- 
deck has the interesting property of imputing observed values. A new random 
hot-deck imputation method is proposed. The key innovation of this method is 
that the selection of donors is viewed as a sampling problem and uses calibration 
and balanced sampling. This approach makes it possible to select donors such 
that if the auxiliary variables were imputed, their estimated totals would not 
change. As a consequence, very accurate and stable totals estimations can 
be obtained. Moreover, the method is based on a nonparametric procedure. 
Donors are selected in neighborhoods of recipients. In this way, the missing 
value of a recipient is replaced with an observed value of a similar unit. This 
new approach is very flexible and can greatly improve the quality of estimations. 

Also, this method is unbiased under very different models and is thus resistant to 
model misspecification. Finally, the new method makes it possible to introduce 
edit rules while imputing. 

Keywords: balanced sampling, calibration, missing data, nearest neighbors, nonre¬ 
sponse, sampling. 


1 Introduction 

Nonresponse is an important problem in surveys. Indeed, the error caused by non¬ 
response on estimates can be more severe than the error caused by the sampling 
design. Nonresponse arises when a sampled unit does not respond to one or more 
items of a survey. One differentiates item nonresponse (a sampled unit does not 
respond to a particular question) from unit nonresponse (a sampled unit does not 
respond to the entire survey). Reweighting procedures are often used to deal with 
unit nonresponse whereas imputation methods are used to treat item nonresponse. 
Imputation denotes a procedure to replace a missing value with a substituted one. 

Imputation methods are classified as either deterministic or random. Determinis¬ 
tic refers to imputation methods that yield the same imputed values if the imputation 
is repeated. Deterministic imputation methods include ratio imputation, regression 
imputation, respondent mean imputation, and nearest neighbor imputation. Deter¬ 
ministic imputation methods produce good totals estimations. Nevertheless they 

* Address for correspondence : Caren Hasler, Institute of Statistics, University of Neuchatel, 2000 
Neuchatel, Switzerland. 

E-mail: caren. hasler@unine . ch 


1 



often fail to estimate quantiles. Random imputation refers to methods that yield 
different imputed values if the imputation is repeated. Random imputation methods 
include among others multiple imputation presented in Rubin (1987), imputation 
with added residuals considered in Chauvet et al. (2011), and random fc-nearest 
neighbor imputation (/cNNI) (see among others Dahl, 2007). Unlike deterministic 
imputation methods, random imputation methods offer the advantage of tending to 
preserve the distribution of the imputed variable. Nevertheless such methods imply 
the presence of an additional amount of variance due to the randomness of impu¬ 
tation, which is called imputation variance. Many authors have been interested in 
minimizing imputation variance. For instance, Kalton and Kish (1981, 1984) pro¬ 
posed two ways to reduce imputation variance. The first way consists of selecting 
donors among the respondents without replacement rather than with replacement. 
The second way consists of first constructing strata using the respondents’ values 
of the variable of interest and selecting proportionate stratified samples to act as 
donors. To the same end, Chen et al. (2000) proposed adjustment of the imputed 
values; Kim and Fuller (2004) and Fuller and Kim (2005) used fractional hot-deck 
imputation, and Chauvet et al. (2011) introduced a class of balanced random impu¬ 
tation methods that consists of randomly selecting residuals while satisfying given 
constraints. 

Imputation methods can alternatively be classified as either donor or predicted 
value. Donor imputation methods replace the missing value of a nonrespondent 
with the observed value of a respondent. The unit providing the value is called a 
donor and the unit receiving the value is called a recipient. A hot-deck method 
is a donor imputation method where a missing value is replaced with an observed 
value extracted from the same survey. Such a method is particularly of interest 
because it imputes feasible and observed values. The reader can, for instance, refer 
to Andridge and Little (2010) for a review of hot-deck imputation. In contrast, 
predicted value imputation methods use function of the respondents values to predict 
the missing values. 

In this paper, a new method of random hot-deck imputation is proposed: the 
balanced /c-nearest neighbor imputation method (b/cNNI). The main feature of this 
method is that the selection of donors is viewed as a sampling problem and uses 
calibration and balanced sampling. This makes it possible to select donors such 
that if the auxiliary variables were imputed, their estimated totals would not change. 
Moreover, this method is based on a nonparametric procedure. Indeed, it provides 
donors selected in neighborhoods of recipients. In this way, the gap due to one 
unit’s missing value is filled with a similar unit’s observed value. The novelty of 
the proposed method lies not only in the fact that the selection of donors uses 
balanced sampling but also in the fact that this is paired with a nonparametric 
selection of donors. Considered together in the same procedure, these two features 
imply a robustness in terms of model misspecification. Moreover, this method uses 
a methodology that makes it possible to take edit rules into account while imputing. 
The proposed method is also particularly effective, produces negligible imputation 
variance and a quasi-null bias in specified cases. 

This paper is organized as follows. In Section 2, notation and concepts of non¬ 
response are reviewed. A methodology for random hot-deck imputation methods is 
introduced in Section 3. Section 4 focusses on the presentation of the b&NNI. Sec¬ 
tion 5 introduces a formula to approximate the conditional imputation variance of 
the total when the new method is applied. Section 6 describes the models underlain 
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by the method and studies the asymptotic properties of the total estimator. Then, 
in Section 7, the performance of the new imputation method and the accuracy of the 
proposed estimator for imputation variance are tested through a simulation study. 
A short discussion concludes the paper in Section 8. 

2 Notation and concepts of nonresponse 

Consider a finite population U = {1,2,..., i ,..., N} and suppose that the target 
is the variable of interest y = (yi, 2/2 • • • 1 Vi, ■ ■ ■, Vn) T ■ In a first phase, a ran¬ 
dom sample S of size n is drawn from U with a given sampling design p (•) where 
p(s) = Pr [S = s ) for s C U. Let 7T* = Pr (i £ S ) denote the first order inclu¬ 
sion probability of unit i and let d, = 1 / 7 r* denote its Horvitz-Thompson weight 
(Horvitz and Thompson, 1952). If a census is considered, the inclusion probabilities 
and the design weights are equal to 1. A vector x* = (xn ,x* 2 , • • • ,Xiq) T of Q aux¬ 
iliary variables is assumed to be known for each unit i in the sample S. In what 
follows, it is supposed that one of the auxiliary variables is constant. In a second 
phase, a subset of respondents S r C S is obtained from S with a usually unknown 
conditional distribution q (S',.IS 1 ). The values y\. of the variable of interest are known 
for the units of S r only. Let S m = S\S r denote the complement of S r in S, i.e. the 
subsample of S containing the units with missing data (the nonrespondents). The 
respective sizes of these subsets are n r and n m with n r + n m = n. For i £ S, let r, 
be the response indicator variable 

_ J 1 if unit i belongs to S r , 

1 \ 0 otherwise. 

It is supposed that the units respond independently from each other. For each unit 
i £ S, ri is therefore generated from a Bernoulli random variable with parameter 
9i = Pr (i £ S r \i £ S). The parameter 0, represents the response propensity of unit 
i and is usually unknown. Hence, the conditional distribution q(S r \S) is a Poisson 
sampling design, i.e. 


q(S r \S)=Y[6i n (l-0i). 

i£S r i^Siri 

Three types of nonresponse mechanisms exist: uniform, ignorable, and non-ignorable. 
A uniform nonresponse mechanism is a nonresponse mechanism where each unit of 
the population has the same response propensity, i.e. 9 t = 9 for each i £ U. It is 
in this case said that the data is missing completely at random (MCAR). An ignor¬ 
able nonresponse mechanism (Rubin, 1976) is a nonresponse mechanism where the 
response propensity 9 t does not depend on the variable of interest once the auxil¬ 
iary variables have been taken into account. In the case of an ignorable nonresponse 
mechanism, the data is said to be missing at random (MAR). Finally, a non-ignorable 
nonresponse mechanism is a nonresponse mechanism where the response propensity 
9i depends on the variable of interest. The data is in this case said to be not miss¬ 
ing at random (NMAR). In a third phase, nonresponse can be corrected through 
imputation. Imputed values y*. j £ S rn are drawn with a conditional distribution 

I{vl\s,s r ). 
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The aim is to estimate the population total 

i&U 

of the variable of interest y. In the case of complete response, the estimator 

Y = E dm, 

i£S 

is a design-unbiased estimator of Y. In the presence of nonresponse, the previous 
estimator is intractable and the imputed estimator 

Yi = E diVi + E d i y *v 

i£Sr j£Sm 


is used. Moreover, consider 


X = ]>>,, 

i&U 

X = £ djXj, 

ieS 

X/ = E + E ^' x i> 

i£S r j£Sm 

where x* represents the imputed value we would have obtained for j G S' m if we 
were to impute the auxiliary variables. For instance, suppose hot-deck imputation 
is used. For each nonrespondent j G S m , a donor i G S r is chosen. The imputed 
value y* for unit j G S m is therefore the donor’s observed value yi. In this case, 
the imputed value x* is the observed value x* of the same donor as the one used to 
obtain y*. 

In the presented framework, the bias and variance of an imputed estimator 9j 
(for a total or another statistic 6) are given by 

Bias (e^ = EpE g E/ (o) - O} , 

Var {ei) = VarpE.E, (dj) + EpVar.E/ (d r ) + E p E g Vai 7 (^) , (1) 

where the subscripts p, q and I indicate respectively the expectations and vari¬ 
ances with regards to the sampling mechanism, with regards to the nonresponse 
mechanism, and with regards to the imputation mechanism. The first term in 
Expression (1) represents the sampling variance, the second term represents the 
nonresponse variance and the last term represents the imputation variance. 

3 Methodology for random hot-deck imputation meth¬ 
ods 

In this section, we propose an original formalization for random hot-deck donor im¬ 
putation. The proposed method is presented by means of this formalization, but this 
last one can be used for any random hot-deck method. Random hot-deck imputa¬ 
tion consists of replacing a missing value with an observed value extracted from the 
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same survey. For each nonrespondent, a donor is hence randomly chosen among the 
respondents. Consequently, a random hot-deck imputation can be achieved through 
the realization of a random matrix <p = ((f>ij), {i,j) G S r x S m such that 


J 1 if nonrespondent j is imputed by respondent i, 
\ 0 otherwise, 


which can be rewritten 

fiij = 1 y*=yi- ( 2 ) 

As exactly one donor must be selected for each nonrespondent, </> must satisfy 


y </>ij = 1, for each j G S m . (3) 

i(E.Sr 


It is here considered that a respondent can be used to impute several nonrespondents. 
Therefore, no conditions are imposed on 


y foil 

jeSm 

for i G S r . Taking the conditional expectation of Equation (2) generates a matrix 
of imputation probabilities xp = (xpij), (i,j) G S r x S m 

iHj = E/ (cjHj) = E 7 (l y* =yi ) = Pr (y* = Vi \S, S r ) . (4) 

By definition, -0 satisfies 

y ikj = 1 f° r eac h j ^ S m , (5) 

i&Sr 

tyij > 0 for each ( i,j ) G S r x S m . (6) 


The considered methodology for random hot-deck donor imputation is therefore 
operated in two stages. In the first stage, the matrix of imputation probabilities ■?/> 
is defined. Then, in the second stage, a realization of the matrix of imputation <f> 
is carried out. This methodology has, among other things, the interesting property 
to make it possible to implement edit rules in the imputation process to correct the 
data at the record level. Indeed, suppose that the value of the variable of interest 
yi of a respondent i G S r is, for some reason, inconsistent with a nonrespondent 
j G S m . To take this inconsistency into account, it is sufficient to set a zero in 
coefficient of the matrix of imputation probabilities ip. In this way, indeed, unit 
i G S r is removed from the set of possible donors for nonrespondent j G S m . Note 
that it might then be required to rescale the column of matrix xp corresponding to 
nonrespondent j in order to ensure that this one still sums up to 1. Two examples 
of application of this methodology are provided below. 


Example 1: simple random imputation method (with replacement) 


Simple random imputation consists of selecting, for each nonrespondent, a donor 
among the respondents. The donors are selected randomly, with replacement (a 
donor can be used several times), and with equal probabilities. This results in the 
matrix of imputation probabilities xp^ srs ^ = (b j) G S r x S m with 


t, 




1 

n r 
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Notation [sr-s] in superscript of the matrix ip means that the matrix linked to simple 
random imputation is considered. 

Example 2: random /c-nearest neighbor imputation method 

The k-nearest neighbors of a nonrespondent unit j £ S m are here defined as its k 
most similar respondents units i £ S r , i.e. 

knn(j) = {i £ S r |rank (d(i, j)) < k} , 

where d (•, •) is for instance the Mahalanobis distance defined through the noncon¬ 
stant auxiliary variables 



where XI is the variance-covariance matrix of the nonconstant auxiliary variables. If 
the values of the auxiliary variables are known on the sample level only, 51 must be 
estimated. 

The random fc-nearest neighbor imputation method (/cNNI) consists of replacing 
the missing value of a unit j £ S m with the value of one of its /c-nearest neighbors. 
The donors are randomly selected with equal probabilities. This results in the matrix 
of imputation probabilities ip^ = ^, (i, j) £ S r x S m with 

Ak\ _ \ \ if i belongs to the k nearest neighbors of j. , , 

d ( 0 otherwise. 

Notation [/c] in superscript of the matrix ip means that the matrix linked to /cNNI is 
considered. The matrix of imputation probabilities related to the /cNNI is a matrix 
containing exactly k non-null coefficients in each column and all these non-null 
coefficients are equal to 1/k. This particular matrix of imputation probabilities is 
the starting point of the method proposed in this paper. 

4 Balanced k -nearest neighbor imputation method 

4.1 Aim of the method 

Random imputation methods show the nice feature that they tend to preserve the 
distribution of the variable being imputed. This is often not the case for deterministic 
imputation methods. In return, this randomness implies the undesirable presence of 
imputation variance. Hence, a random imputation method that makes it possible to 
keep the imputation variance relatively small is of great interest. Moreover, donor 
imputation methods, such as hot-deck, have the advantage of imputing feasible and 
observed values. 

An imputation method can be based on either a parametric procedure or a 
nonparametric procedure. The hypotheses underlain by parametric procedures are 
strong. In this case, one can assume that the distribution of the variable of interest is 
known, but the parameters of this distribution must be estimated. Practically, this 
kind of hypotheses is satisfied in only a few cases. On the other hand, hypotheses 
underlain by nonparametric procedures are much weaker. Hence, such procedures 
are usually much more flexible than parametric procedures. An imputation method 
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based on a nonparametric procedure therefore makes it possible to handle a wider 
range of data without violating its underlain hypotheses than an imputation method 
based on a parametric procedure. 

The proposed method shows the nice features stated above. First, it is a random 
hot-deck imputation method. It therefore tends to preserve the distribution of the 
variable being imputed and it imputes observed and feasible values. Moreover, even 
though it is random, the proposed method makes it possible to control the imputa¬ 
tion variance of the total estimator. Indeed, the imputation process conserves the 
estimator of the total of the auxiliary variables. If the auxiliary variables suffered 
from nonresponse, their imputed total estimator would match their total estimator 
under complete response. The imputation mechanism relative to bfcNNI is there¬ 
fore such that conditionally on the sampling mechanism and on the nonresponse 
mechanism 

Xj = X. (8) 

If a linear model fits well the relation between the variable of interest and the aux¬ 
iliary variables, the imputation variance of the total estimator can in this way be 
reduced. As a result, the proposed method makes it possible to impute randomly 
while keeping imputation variance of the total estimator relatively small. Then, the 
proposed method is based on a nonparametric procedure. Indeed, the donors are 
chosen in neighborhoods of recipients. For each nonrespondent, a donor is randomly 
selected among its ^-nearest neighbors. As nonparametric, this procedure is very 
adaptive and makes it possible to impute values that are close to the unobserved 
missing value for a wide range of data. Moreover and as stated above, the impu¬ 
tation process conserves the estimator of the total of the auxiliary variables. If a 
linear model fits well the relation between the variable of interest and the auxiliary 
variables, this property implies that the imputed total estimator for the variable 
being imputed is close to the total estimator that would have been obtained under 
complete response (Horvitz-Thompson estimator). As this last one is unbiased, the 
obtained imputed total estimator is nearly unbiased in the case specified above. De¬ 
tails regarding the properties of the total estimator are given in Section 6. Last, the 
proposed method used the methodology proposed in Section 3. This makes it possi¬ 
ble for the user to take into account edit rules directly while imputing as explained 
in Section 3. 

In what follows, it is explained how donors can be obtained. Notation [bk\ in 
the superscript of the matrices 0 and 0 means that the matrices linked to the 
b/cNNI are considered. The method proceeds in two steps. The first step consists 
of obtaining the matrix of imputation probabilities 0^ fc l whereas the second step 
consists of generating a realization of the matrix of imputation 0^. The aim is 
that the imputation mechanism satisfies Equation (8). With this aim, the matrix of 
imputation probabilities 0^ is, in the first step, constructed such that Equation 

E/ (X/) = X (9) 

is satisfied and the matrix of imputation probabilities 0^ j S; j n the seC ond step, 
generated such that Equation 

X/ = E/ (X/) . (10) 

is satisfied. These two steps are presented in Section 4.3 and in Section 4.6 respec¬ 
tively. Equation (9) together with Equation (10) lead to Equation (8). As a result, 
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the imputation mechanism obtained through the two steps described above satisfies 
Equation (8). 


4.2 Calibration 


The aim of this Section is to briefly describe calibration (Deville and Sarndal, 1992) 
which is the main tool used in Section 4.3 to obtain the matrix of imputation prob¬ 
abilities 0^. Suppose a vector of Q auxiliary variables x* = (xn,x&, ■ ■ ■ ,Xiq) T is 
known for each unit of the population U. The aim of calibration is to find calibration 
weights Wi for i € S as close as possible to the initial design weights dj, = l/i r* (in 
an average sense for a given distance) while respecting the calibration equation 

y wiXi = y Xj = x. 

ies ieu 


Several distance functions are proposed in Deville and Sarndal (1992) as a means of 
measuring the distance between the initial design weights di = l/i r* and the final 
weights Wi . Each distance provides a particular form for the final weights W{. The 
raking method is the calibration obtained considering the distance function G(-, •) 
given by 


G(wi,di) 



Wi + di. 


This leads to the final weights w % = di exp (A T x;) where the vector A = (Ai, A 2 ,..., Aq) T 
is the solution to the calibration equation 

y di exp (a t X j) Xj = y Xj. 

ieS ieu 


From Deville and Sarndal (1992), if a solution A to this calibration problem exists, 
then it is unique. 


4.3 Obtaining the matrix of imputation probabilities '0^ 

In this Section, a procedure to obtain the matrix of imputation probabilities 0^ 
is presented. This matrix must satisfy 

^[bk] q on iy ^ ^ ^ knn(j), 

because it is imposed that donors are chosen in neighborhoods of recipients. More¬ 
over, as stated in Section 3, a matrix of imputation must satisfy equations (5) and 
(6). Finally, as stated in Section 4.1, in order to select donors such that Equation (8) 
is satisfied, it is imposed on 0^ to satisfy Equation (9), i.e 

E/ (Xj) = X. (11) 

This constraint states that conditionally on the sampling mechanism and the nonre¬ 
sponse mechanism, the imputation mechanism provides an unbiased imputed total 
estimator for the auxiliary variables. Equation (11) is equivalent to 

d .i y x < = y 

jeSm i&Sr j£S m 
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Hence, the matrix of imputation probabilities ip^ = must satisfy simulta¬ 

neously 

^[bk] g on jy j£ ^ g knn(j), 

y 4 k] = 1 for eaci1 j g s m, 

ieSr 

^[bk] > g £ or each (i,j) g 5V- x S m , 

y ^ y Wf’xi = y ^ Xj . 

j€.Sm i€.S r j€.Sm 

The existence of a matrix of imputation probabilities ip^' 1 ^ satisfying the above 
conditions and the choice of the number of nearest neighbors k are discussed in 
Section 4.4. Algorithm 1 presents a procedure to obtain the matrix of imputation 
probabilities ip^. The main idea of Algorithm 1 is to find a matrix of imputa¬ 
tion probabilities ip^ close to the matrix of imputation probabilities relative to the 
fcNNI ip^ k \ while satisfying equations (12), (13), (14), and (15). This Algorithm 
initializes with the matrix ip^ k \ Throughout all the steps, a null coefficient remains 
null which implies that equation (12) is satisfied. Thereafter, calibrations and nor¬ 
malizations are alternated. Calibrations provide matrices ip{21) for t > 1 satisfying 
equations (14) and (15). However, these matrices ip{21) are not matrices of imputa¬ 
tion probabilities because they do not satisfy (13). Normalizations provide matrices 
ip(21 + 1) for i > 1 satisfying equations (13) and (14) but not necessarily satisfying 
equation (15). Iterations stop when the matrix ip{2£ + 1) obtained by normaliza¬ 
tion approximately satisfies equation (15). The matrix ip^ bk 1 is the last ip{2£ + 1) 
considered. Hence, ip ^ satisfies equations (12), (13), (14), and (15) simultaneously. 


( 12 ) 

(13) 

(14) 

(15) 


4.4 Choice of k and existence of ip 


In this Section, the choice of the number of nearest neighbors k and existence of a 
matrix of imputation probabilities ip having the required properties are discussed. 

The number of nearest neighbors k is relevant when constructing the matrix 
ip\ bk } in Section 4.3. Indeed, this matrix must contain at most k ■ n m non-null 
coefficients as confirmed by Equation (12). Moreover, the coefficients of this matrix 
must satisfy Equation (13) and Equation (15). These two equations together form 
a system of n m + q constraints. Hence, when constructing the matrix of imputation 
probabilities ip^ bk \ the aim is to find k ■ n m unknown coefficients that satisfy n m + q 
linear constraints. As a result, a necessary condition to find a matrix ip^ satisfying 
Equations (12), (13) and (15) is 


k > 


n m + q 


(16) 


However, this condition is not sufficient to ensure that a solution ip^ exists and 
elements external to the choice of k have an effect. Such an element is the con¬ 
figuration of the nonrespondents. Indeed, Equation (14) and Equation (13) imply 
together that all the coefficients ip must lie between 0 and 1. Consider the ex¬ 
treme case in which all the units with the largest values of one auxiliary variable are 
nonrespondents. For this auxiliary variable, and as all the coefficients ipf^ must lie 
between 0 and 1, the small values of respondents do not make it possible to compen¬ 
sate the large missing values of the nonrespondents. Hence, a matrix ip^ with the 
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Algorithm 1 Procedure to obtain the matrix of imputation probabilities 


• Initialization 

Set ^(l) = the matrix of imputation probabilities relative to the fcNNI defined 
in Expression (7). 

• Iterations 

Repeat for I = 1,2,... 

— Calibration 

For i £ S r , consider the initial weights di = Y^jeSm djtp(2£ — 1)^ and obtain 
the calibrated weights Wi = di exp ^A T x^ by means of the raking method. The 
calibration equation is 

X! ,r ' X ' = X d ’ X r 

ieSr jeSm 


— Let ip(21) be the matrix defined as ip(2£)ij = ip(2£ — 1)^ exp ^A T x, ; 

— Normalization 

Let ip(21 + 1) be the matrix defined as %p(2£ + 1)^ = ^ . ■ 

— Stop criterion 

If 

SieSr + 1 )ij x iq ~ Yhj^S m d i X N , 

max — -—----- < i 

1 <q<Q Y^ijeSm 


for a small fixed error tolerance tol , then 

* Set ip [bk] = ip(2l + 1), 

* Stop. 





properties stated above might not exist, whatever the value of k is. Condition (16) 
is therefore necessary but not sufficient for a solution to exist. Note that, if a so¬ 
lution exists, then it is unique. Indeed, the solution to the calibration problem in 
Algorithm 1 is unique (see last sentence of Section 4.2). 

The choice of k can have an impact on the bias of the total estimator. Indeed, 
the smaller k is, the closer the values of the auxiliary variables are in a neighborhood 
of k units and therefore the closer the values of the variable of interest y tend to 
be in the same neighborhood. As a result and from Property 3 below, under the 
hypothesis that the values of variables y are similar in a neighborhood, the smaller 
k is, the smaller the bias of the total estimator tends to be. Hence, with the aim 
of controlling the bias, k should be chosen as small as possible. However, note that 
this choice has no impact on the bias of the total estimator when the hypotheses 
of Property 1 or those of Property 2 are satisfied. Indeed, the bias of the total 
estimator are in these cases null whatever the value of k is. Moreover, the larger k 
is, the larger the imputation variance of the total is. 

For these reasons, we suggest to choose the smallest value k for which a solution 
for the computation of matrix ^ bk ^ exists. Thus, we propose to fix the value of k as 
follows. First, choose a value of k that satisfies Equation (16) and that is relatively 
small. Then, apply Algorithm 1 and see if this one finds a solution, i.e. returns a 
matrix if }^ bk 1 satisfying the required conditions. If this is the case, the user can then 
apply Algorithm 2 in order to select the donors. Otherwise, we propose gradually 
increasing the value of k and repeating this procedure until a solution is found. 


4.5 Stratified balanced sampling 


The aim of this Section is to briefly describe stratified balanced sampling. This 
represents the main tool used in Section 4.6 to obtain the matrix of imputation 
4>^ bk \ Suppose a vector of Q auxiliary variables x, = (xn,Xi 2 , ■ ■ . ,Xjg) T is known 
for each unit of the population U. A sampling design p(-) is said to be balanced on 
these auxiliary variables if 


E 


Xj 

TP 


i&U 


for each s C U with p(s) > 0. This last equation can be rewritten 

y —Hies=y x,, 

ieu 1 ieu 


where lj £s is the indicator function which takes value 1 if unit i belongs to s and 
0 otherwise. The cube method (Deville and Tille, 2004) is a method for balanced 
sampling. It allows a balanced sample to be selected while satisfying the inclusion 
probabilities in the sense that E (lies) = 7Tj for each i £ U. Suppose moreover that 
the population U is partitioned into H nonoverlapping strata U \, I/ 2 ,... ,Ufj. A 
stratified balanced sampling design is a sampling design balanced in each stratum, 
i.e. 


y l*es — Xj for each 1 < h < H, 
ieu h ' Kl ieu h 


(17) 


for each s C U with p(s) > 0. Chauvet (2009) and Hasler and Tille (2014) proposed 
methods for stratified balanced sampling which are based on the cube method. The 
algorithm proposed in Hasler and Tille (2014) is particularly fast and is applicable 
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when the number of strata is very large. The samples selected with these methods 
are approximately balanced in each stratum and approximately balanced in the 
overall population while satisfying the inclusion probabilities in the sense that 

E (lies) = 7 Ti for each i € U. (18) 

Remark 1. If the sum of the inclusion probabilities 

rih = y Tti 
ieu h 

is integer in each stratum Uh, 1 < h < H, the algorithm proposed in Hasler and Tille 
(2014) selects exactly nh units in each stratum Uh, 1 < h < H. 

4.6 Selection of the donors 

In this Section, a procedure to obtain the matrix of imputation <fi ^ from the matrix 
of imputation probabilities j s presented. The main idea of this procedure is 
to select a donor among the respondents for each nonrespondent while satisfying 
constraints. The key feature of the method is that the selection of donors is viewed 
as a sampling problem and the constraints are satisfied through stratified balanced 
sampling, where only one unit is selected in each stratum. 

As stated in Section 4.1, in order to select donors such that Equation (8) is 
satisfied, it is imposed on <f>^ to satisfy Equation (10), i.e 

Xj = Ej (X/) . (19) 

This constraint states that the imputation variance of the auxiliary variables cancels 
out. A sufficient condition for Equation (19) to hold is 

y X; = y dji.'lj' x, for each j € S m , 

i€.Sr i€:Sr 

which can be rewritten 

7 / 

y J ik) = y (i J r 'j x; f ° r each 3 e 5 m. 

iGSr Vij i&Sr 

Moreover, matrices and must be linked by Equation (4) , i.e. 

= E/ for each (bi) e S r x S m . 

The aim is therefore to generate matrix suc h that 

, I [bk] 

y ~ fbk] l ^ ] = y <i J i ui x; f ° r each 3 e Sm > ( 2 °) 

ieSr y ieSr 

E/ (') = for each (i,j) € S r x S m . (21) 

A matrix <f>^ satisfying exactly the above equations often does not exist. However, 
when generating this matrix with the procedure presented below, the constraints 
are relaxed until a solution is found. See Remark 3. A solution to this problem was 
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proposed in Chauvet et al. (2011). A slight modification of that is presented here. 
Consider the population of cells 

U = {(i,j)\i £ S r ,j £ S m } . 

This population is partitioned into n m strata Uj , j £ S m where Uj = {(i,j)\i £ S r }. 
Each stratum corresponds to one nonrespondent. Then, exactly one unit will be 
selected in each stratum, providing in this way exactly one donor for each nonre¬ 
spondent. For each unit ( i,j ) £ U, consider the initial inclusion probability 

> 

and the auxiliary variables 

, , [bk\ 

= djTpij Xi. 

\bk] 

Moreover, as r/A is 1 if respondent i is the donor for nonrespondent j and 0 other¬ 
wise, consider 

1 (i,j)es - <t>ij ■ 

It means that respondent i £ S r is used to impute the missing value of nonrespondent 
j £ S m if unit (i,j) is selected in the sample. The problem formed by equations (20) 
and (21) can be rewritten as follows 

^~ 1 (hi)eS = Y for each 3 e ( 22 ) 

(ij)euj 

E i i l (i,j)es) = 7T( i,j ) for each (i,j) £ U. (23) 

This is a typical problem of stratified balanced sampling where only one unit is 
selected in each stratum because each respondent receives exactly one value. Equa¬ 
tions (22) and (23) correspond respectively to equations (17) and (18). The pro¬ 
cedure to obtain matrix c/j^ therefore uses stratified balanced sampling and is 
presented in Algorithm 2. The first step of the algorithm consists of selecting a 
stratified balanced sample in the cell population U = {(i,j)\i £ S r ,j £ S m } such 
that equations (22) and (23) are satisfied. In a second and last step, matrix cj)^ is 
obtained by setting (pf^ = 1 if the cell (i,j) has been selected in the sample and 0 
otherwise. 

Remark 2. For each j £ S m , the following equation is satisfied 

Y = Y ^ If 1 = L 

(■ i,j)eUj 

This means that the sum of the inclusion probabilities is equal to 1 in each stratum 
considered in the stratified balanced sampling problem of Algorithm 2. Therefore, as 
the procedure for balanced stratified sampling proposed in Hasler and Tille (20If) is 
applied in Algorithm 2 and from Remark 1, exactly 1 unit is selected in each stratum, 
i.e 


Y 1 (biles' = 1 f or each J £ s m- 

(bi)e£/j 
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Moreover, as 


X! 1 (i,i)eS = ]C ^ f or each i G Sm ’ 

(■ i,j)eUj *es r 

the matrix (f)^ satisfies Equation (3). This means that Algorithm 2 provides exactly 
one donor for each nonrespondent j £ S m . 

Remark 3. It is often not possible to select samples such that the balancing equations 
are exactly satisfied. As a result, Algorithm 2 often generates a matrix 4>' bk ^ such 
that Equation (22) is only approximately satisfied. Equivalently, donors are often 
selected such that Equation (19) is only approximately satisfied. 


Algorithm 2 Procedure to obtain the matrix of imputation <f^ bk ^ 

• Stratified balanced sampling 

Select a stratified balanced sample S in the cells population U = {{i,j)\i £ S r ,j £ S'™} 
with the method proposed in Hasler and Tille (2014) such that 

^ -r^-l(i,j)es = X] xpj) for each j £ S m , 

(ij)eUj (iJ)eUj 

E i (l (ij)es) = K(ij) for each (i,j) £ U, 

where 

— X(ij) = djip^Xi is the vector of balancing variables linked to unit (i,j) £ 

S r X Sm, 

— is the inclusion probability attached to unit {i,j) £ S r x S m , 

— Uj = {(i, j)\i £ SV} is the stratum attached to unit j £ S m . 

• Matrix of imputation 

Let <fi bk ^ be the matrix defined as 


5 Approximation of conditional imputation variance 

A procedure to approximate the imputation variance conditional to the design and 
on the nonresponse mechanism of the total Var i(Yj) is described in this Section. For 
the sake of brevity, we refer to the latter variance as conditional imputation variance, 
because it is conditional to the sampling design and the nonresponse mechanism. 
The proposed procedure relies on the idea that, in the new method, the selection 
of donors is viewed as a sampling problem and imputation is achieved through 
stratified balanced sampling. However, all the values of the variable of interest are 
known prior to selecting the sample of imputed values. Indeed, donors are selected 
among respondents and the values of the variable of interest are fully observed for 
these. Deville and Tille (2005) proposed an approximation formula for the variance 
of a total under balanced sampling that can be used in this framework. Based on 
this approximation formula, we propose the following formula for the conditional 
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imputation variance. 


Varf p (Y r ) = Y Cijtf (yi ~ b T x^ , (24) 

i£Sr jes m 


. _ A bk \ ( 1 _ n ™ k 

Cv-Vtj ^ ) nmk _ q ’ 


( 


b = 


\ 


-i 


Y Y r 'jdj x ' x ; 


i£.Sr j€Sm 

v <C#o 


Y Y c ij (, ] x >yi- 

i€S r j&S m 


Notice that only a single respondents set is necessary to approximate the con¬ 
ditional imputation variance. However, it can underestimate this one. Indeed, this 
formula comes from the variance of the total for balanced sampling. When ap¬ 
plying balanced sampling, it is often not possible to exactly satisfy the balancing 
constraints, which is referred to as a rounding problem. A sample can thus be only 
approximately balanced (see Deville and Tille, 2004). Indeed, the balancing con¬ 
straints must be relaxed in order to make it possible to select a sample. Hence, 
the variance of the total when balanced sampling is applied can be broken down 
into two terms: a first term derived under the hypothesis that the balancing equa¬ 
tions are perfectly satisfied and a second term due to the rounding problem (see 
Deville and Tille, 2005). As the other approximations and estimators of the vari¬ 
ance proposed in this framework, Formula (24) does not capture the part of the 
variance due to the rounding problem and can therefore underestimate the condi¬ 
tional imputation variance. 

The stronger the linear relation between the variable of interest and the aux¬ 
iliary variables is, the more formula (24) tends to underestimates the conditional 
imputation variance. To understand the reason for this, suppose that there is a 
strict linear relation between the variable of interest and the auxiliary variables. In 
this case, if the auxiliary variables are perfectly balanced, so is also the variable of 
interest. As a result, the term in the variance due to the balancing itself is null. 
Hence, the variance is only due to the rounding problem. In this case, an estima¬ 
tor that captures only the term due to the balancing therefore captures 0% of the 
actual variance. Then, as the linear relation between the variable of interest and 
the auxiliary variables weakens, the variable of interest becomes less well balanced. 
This implies that the variance due to the balancing increases, and, therefore, that 
the part of the actual variance that is returned by an estimator that captures only 
the variance due to this one increases. 


6 Properties of the imputed total estimator 

The performance of the imputed total estimator under the proposed imputation 
method relies on two underlying models: the linear model and the response model. 
Moreover, as donors are chosen in neighborhoods of recipients, the performance of 
the imputed total estimator depends on a third principle, which is the neighborhood 
principle. These models and principles are here detailed. Moreover, the asymptotic 
properties of the imputed total estimator are studied. 
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6.1 Linear model 

Property 1. 1. Suppose the data is MAR (or MCAR). Consider the linear model 

m: yt = f3 T Xi + e* with E m (e*) = 0, 

where E m (.) denotes the expectation with respect to the model m. If the model 
m holds, then the bkNNI provides an unbiased imputed total estimator Yj in 
the sense that 

Bias (P/) = E m EpE 9 E/ (Yj - Y^ = 0. 

2. Moreover, if the relation between the variable of interest and the auxiliary 
variables is strictly linear, i.e. 

Vi = /3 T x ?: , 

then the bkNNI provides an imputed total estimator Yj with a quasi-null im¬ 
putation variance, i.e. 

Yax im p = EpEgVar/ (Y/) « 0, 

regardless of the nonresponse process (MCAR, MAR or NMAR). 

The proof is given in the Appendix. It results that if the linear model m reason¬ 
ably fits the population data, then the b/cNNI provides an almost unbiased imputed 
total estimator Yj with a small imputation variance. 


6.2 Response model 

Property 2. Let -ip [bk] = (V’jf 1 ) € S r x S m , be the matrix of imputation 

probabilities relative to the bkNNI. If 


1 + Ylj 


jCSr, 


f±A bk 1' 

di Yij 


perfectly fits the true response probability of each uniti € S r , then the bkNNI provides 
an unbiased imputed total estimator Yj, i.e. 


Bias (Y/) = E p E g E/ (? r - Y^j = 0. 

The proof is given in the Appendix. It results that if the model stated above 
estimates the response probabilities for i G S r reasonably well, then the imputed 
estimator Yj is an almost unbiased estimator for Y. 

This result can be interpreted in the following way. Respondent i G S r acts as 
a donor a certain number of times in such a way that in expectation its weight is 
equal to 

di+Yl d ^i f 1 ' 

jeSm 

If di is the response probability of unit i € S r , then the bias due to nonresponse 
is compensated. This can happen when the auxiliary variables x,; can describe the 
nonresponse mechanism, because the weights are obtained by calibration on 
the estimated totals of these variables. 
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6.3 Neighborhood principle 

If none of the two previous models hold, a third principle can correct the situa¬ 
tion, namely the neighborhood principle. The neighborhood principle states that 
neighboring units (i.e units showing close auxiliary values) show close y values. 

Property 3. Consider (■ i,j ) £S r x S rn . If the implication 

i € knn(j) => y % - y 3 = 0 

holds, then the bkNNI provides an unbiased imputed total estimator Yj, i.e. 

Bias (9^) = E p E g E/ (Yj - y) = 0. 

The proof is given in the Appendix. It results that if neighboring units (i.e units 
showing close auxiliary values) have y values which are close, then the imputed 
estimator Yj is an almost unbiased estimator for Y. 

6.4 Resistance to model misspecification 

The new method is resistant to model misspecification in terms of bias of the im¬ 
puted total estimator Yj. It is indeed sufficient that only one of the three models 
or principle stated above holds to obtain an unbiased imputed total estimator Yj. 
However, a unique model provides an imputed total estimator Yj with a quasi-null 
imputation variance, namely the strictly linear model yi = /3 T x*. 

6.5 Asymptotic properties of the total estimator 

We now study the asymptotic properties of the total estimator under bfcNNI. Sup¬ 
pose that there is a sequence of finite populations indexed by I such that the pop¬ 
ulation size Ni and the sample size nt tend to +oo as l —> +oo. Thereafter, index 
l is omitted in order to make the notation less cluttered but the asymptotic results 
and convergences are understood to be as l —>• +oo. The following assumptions are 
considered: 

(Al): irij - ninj = O (Jfj) for each i,j 6 [/, i / j. 

(A2): di = O (^) for each i G U. 

(A3): The data is MAR. 

(A4): The following model holds: 

m: yi = (3 T Xi + £i, 

with E m {£i) = 0, E = a 2 < +oo if i = j and 0 otherwise and where 
E m (-) denotes the expectation with respect to the model m. 

(A5): The imputation design is exactly balanced, i.e. 

£ 

i£.Sr i£.Sr 


for each j G S m . 
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(A6): The approximation of the conditional imputation variance is exact, i.e. 

Var/(Y/) = Var“ pp (l}) = ^ ^ (Hjd 2 (yi - b T x^ , 

i€.S r j€.Sm 

where c t j and b are defined below Formula (24). 

(A7): # > 0 |i € S r | = O (-^j for each j,l G S'™ such that j / A 

This hypothesis states that the way the donors distribute from one nonre¬ 
spondent to another are not very dependent. This constraint is incompatible 
with the fact that the same respondents are always used as donors. 

Proposition. Suppose assumptions (Al) to (A7) hold. Then 

%-Y 

N 

converges in probability to 0. 

The proof is given in the Appendix. 


7 Simulation study 

A brief simulation study is conducted to test the performance of the new imputation 
method and to test the accuracy of the proposed estimator for imputation variance. 

7.1 The data 

The MU284 population from Sarndal et al. (1992) was considered here. This data 
set is available in the R sampling package (Tille and Matei, 2007). The following 
variables were considered (the initial names of the variables are written in brackets): 

• y: revenues from 1985 municipal taxation, in millions of kronor (RMT85), 

• x 1 : 1985 population, in thousands (P85), 

• x 2 : 1975 population, in thousands (P75), 

• x 3 : number of Conservative seats in municipal council (CS82). 

The correlations between y and the variables x 1 , x 2 and x 3 are respectively 0.96, 
0.97 and 0.52. The population size is N = 284. Two cases were considered, namely 

Case 1: The three auxiliary variables (x 1 , x 2 , and x 3 ) were considered, 

Case 2: Only the auxiliary variable that is the less correlated to y (namely x 3 ) was 
considered. 

The model m defined in Section 6.1 induces a R 2 which is approximately 0.94 and 
0.27 in Case 1 and in Case 2 respectively. 
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7.2 Simulation settings 

A census was considered, which means that 7 Tj = d t = 1 for each unit i of the 
population U = {1,2,..., TV}. The sample therefore matches the population, i.e. 
S = U. One hundred respondents sets were created by generating 100 response 
indicator vectors R. Each component ri,i £ U of R was generated from a Bernoulli 
distribution with parameter 


1 + exp(l -fan)' 

where /3 is a positive coefficient used to reach the mean response rate 70% (MAR), 
Xu is the value of the variable for unit i GU, and £ = 1,3 in Case 1 and in Case 
2 respectively. 

For each respondents set, 100 imputations were conducted with each of the fol¬ 
lowing methods: 

• NNI: nearest-neighbor, 

• PMM: predictive mean matching proposed by Little (1988), 

• SRS: random hot-deck, donors randomly selected with replacement in the re¬ 
spondents set, 

• SRSWOR: same as SRS except that donors are selected without replacement 
as proposed in Kalton and Kish (1981, 1984), 

• LNNI: fc-nearest neighbor, 

• bfcNNI: proposed method, balanced fc-nearest neighbor, 

with k = 20. For each imputation, the total, the 10th percentile, the 90th percentile, 
and the variance of the imputed variable of interest were estimated. Note that, for 
bLNNI, the matrix of imputation probabilities was replaced by the matrix 

of imputation probabilities defined in Expression (7) for the simulations in 
which Algorithm 1 failed to find a solution (see Section 4.4). Moreover, for each 
simulation, the imputation variance of the total obtained with the proposed method 
was estimated using Expression (24). 


7.3 Measures of comparison 

In order to measure the bias of the imputed estimator 0/ for a parameter 9, the 
Monte Carlo relative bias RB was considered. It is defined as 


where 


RB(0 / )=^-^, 


M r Mi 

a* _ 1 2 r,i 

~ T ~TTT 2 ; / <1 ’ 


Mr Mj 


r =1 i— 1 


Mr = 100 is the number of respondents sets generated, Mj = 100 is the number of 
imputations conducted for each respondents set, and Of 1 is the estimate obtained for 
the 2-th imputation of the r-th respondents set generated. The quantity 0j therefore 
represents the mean of the estimated value of the parameter 9 over the MrMj 
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simulations. The variability of the imputed estimator 0/ was measured through the 
Monte Carlo relative root mean square error (RRMSE) defined as 


where 


rrmse [eA = 


MSE 0/ 


0 


MSE ( 0 ^ 


1 1 

M r Mi 


Mr Mi 

E e («;■' 

r =1 i =1 



Finally, the Monte Carlo relative root imputation variance (RRIV), or relative im¬ 
putation standard deviation, of the imputed estimator 0/ was computed in order to 
measure the amount of variance due to imputation. R is defined as 


where 


RRIV 0/ = 


IV 0/ 


0 


IV (0/) 


1 

Mr 


E 


r —1 


1 

Mi - 1 




and 


0 


r 

I 


1 

Wi 


Mj 

EV 

1=1 


represents the mean estimated value of 0 for the r-th respondents set. 

In order to test the accuracy of the variance formula of Expression (24), the 
average over the simulations of the approximated conditional imputation variance 
was computed, namely 


1 

Mr 


Mr 

E Var f(Y,Y, 

r =1 


where Varj PP (l/) r is the imputation variance obtained with Expression (24) for the 
r-th respondents set generated. That one was then compared to the Monte Carlo 
imputation variance of the total IV (V/) defined above. 


7.4 Results of the simulations 

Table 1 and Table 2 show measures of comparison for the six imputation methods 
considered in Case 1 and Case 2 respectively. Table 3 displays the average over the 
simulations of the estimated imputation variance of the total as well as the Monte 
Carlo imputation variance of the total. 

The results confirm that the proposed method (b/cNNI) performs particularly 
well when there is a strong linear relation between the variable of interest and the 
auxiliary variable, as in Case 1 ( R 2 ~ 0.94). Indeed, results of Table 1 show that, 
in Case 1, bfcNNI outperforms the other donor imputation methods considered. It 
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provides the smallest RB and the smallest RRMSE for each parameter of interest 
considered. 

Moreover, Table 1 and Table 2 show that the neighborhood principle and the 
balancing principle have an effect on the imputation variance of the total. This 
effect depends on the strength of the relation between the variable of interest and 
the auxiliary variables. Indeed, in Case 1 (strong linear relation) RRIV of the total 
is 0.094 for SRS, which reduces to 0.008 for /cNNI (neighborhood principle) and to 
0.002 for bfcNNI (neighborhood principle and balancing principle) whereas, in Case 
2 (weak linear relation) these figures are 0.093, 0.019, and 0.016. 

The results also show that selecting the donors without replacement (SRSWOR) 
among respondents induces a smaller imputation variance than selecting them with 
replacement (SRS), which is in agreement with Kalton and Kish (1981, 1984). 

Finally, the results confirm that the performance of the proposed method relies 
on the strength of the linear relation that governs the data. Indeed, in Case 2 
(Table 2) this linear relation is much weaker than in Case 1 and the proposed 
method shows diminished performance compared to that observed in Case 1. Note 
that, in Case 2, the proposed method nevertheless still performs better overall than 
the other methods considered. 

The results in Table 3 confirm that Formula (24) can underestimate the imputa¬ 
tion variance. The magnitude of this underestimation goes along with the strength 
of the linear relation between the variable of interest and the auxiliary variables. 
Indeed, in Case 2 (weak linear relation), the average approximated conditional im¬ 
putation variance represents more than the 90% of the Monte Carlo imputation 
variance of the total. This quantity drops to approximately 60% in Case 1 (strong 
linear relation). 

8 Conclusion 

In this paper, a new method of random hot-deck imputation, called balanced k- 
nearest neighbor, has been proposed. This method has the interesting property of 
being a donor imputation. It therefore produces observed and feasible values. The 
novelty of this method is that the selection of donors is viewed as a sampling problem 
and uses calibration and balanced sampling. Also, selection of donors is achieved in 
a nonparametric manner as donors are selected in neighborhoods of recipients. As 
this method is random, it can be used for total estimation as well as for quantiles 
and variance estimation. 

This method offers the nice advantage that it produces a total estimator with 
negligible imputation variance and a quasi-null bias in specified cases. Indeed, the 
method involves three underlying models or principles. They provide conditions for 
the imputed total estimator to be an unbiased estimator and for the imputation 
variance of that estimator to cancel. The method is resistant to model misspecifica- 
tion in terms of bias but a unique model results in a quasi-null imputation variance 
of the total. 

A formula to approximate the conditional imputation variance of the total has 
been suggested. The procedure used is inspired by that applied to estimate the 
variance of the total for balanced sampling. The proposed approximation tends to 
underestimate the conditional imputation variance of the total. 

Finally, a simulation study has been conducted to test the performance of the 
proposed method and that of the approximation formula of conditional imputation 
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Table 1: Monte Carlo relative bias (RB), Monte Carlo relative root mean square 
error (RRMSE), and Monte Carlo relative root imputation variance (RRIV) for the 
total estimation, the 10-th percentile estimation, the 90-th percentile estimation, 
and the variance estimation of the variable of interest y in Case 1. 




Monte Carlo estimates 

Parameter of interest 

Method 

RB 

RRMSE 

RRIV 

Total 

NNI 

0.008 

0.010 

0.000 


PMM 

0.015 

0.017 

0.000 


SRS 

0.281 

0.297 

0.094 


SRSWOR 

0.278 

0.288 

0.070 


fcNNI 

0.030 

0.032 

0.008 


bfcNNI 

-0.001 

0.003 

0.002 

10-th percentile 

NNI 

0.067 

0.093 

0.011 


PMM 

0.039 

0.093 

0.010 


SRS 

0.187 

0.120 

0.040 


SRSWOR 

0.186 

0.198 

0.031 


fcNNI 

0.098 

0.124 

0.046 


bfcNNI 

0.006 

0.083 

0.053 

90-th percentile 

NNI 

0.002 

0.009 

0.000 


PMM 

0.005 

0.013 

0.000 


SRS 

0.246 

0.256 

0.068 


SRSWOR 

0.248 

0.255 

0.054 


fcNNI 

0.009 

0.018 

0.015 


bfcNNI 

0.000 

0.006 

0.005 

Variance 

NNI 

-0.001 

0.002 

0.000 


PMM 

-0.002 

0.002 

0.000 


SRS 

0.389 

0.533 

0.361 


SRSWOR 

0.379 

0.466 

0.267 


fcNNI 

-0.004 

0.004 

0.002 


bfcNNI 

0.000 

0.001 

0.000 


variance. It has been confirmed that the new method performs well when a strong 
linear relation governs the data and that this performance decreases as this linear 
relation weakens. Lastly, it was confirmed that the formula for imputation variance 
of the total tends to underestimate the conditional imputation variance of this one. 
Note that the estimation of the variance due to the rounding problem is still an un¬ 
resolved problem. This variance can also be approximated by multiple imputations. 
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Table 2: Monte Carlo relative bias (RB), Monte Carlo relative root mean square 
error (RRMSE), and Monte Carlo relative root imputation variance (RRIV) for the 
total estimation, the 10-th percentile estimation, the 90-th percentile estimation, 
and the variance estimation of the variable of interest y in Case 2. 




Monte Carlo estimates 

Parameter of interest 

Method 

RB 

RRMSE 

RRIV 

Total 

NNI 

-0.001 

0.030 

0.017 


PMM 

0.000 

0.030 

0.017 


SRS 

0.207 

0.230 

0.093 


SRSWOR 

0.207 

0.222 

0.069 


fcNNI 

0.004 

0.030 

0.019 


bfcNNI 

-0.001 

0.028 

0.016 

10-th percentile 

NNI 

-0.013 

0.080 

0.047 


PMM 

-0.012 

0.080 

0.047 


SRS 

0.092 

0.108 

0.036 


SRSWOR 

0.092 

0.105 

0.027 


fcNNI 

0.023 

0.076 

0.046 


bfcNNI 

0.005 

0.074 

0.045 

90-th percentile 

NNI 

0.004 

0.051 

0.034 


PMM 

0.005 

0.052 

0.034 


SRS 

0.193 

0.211 

0.072 


SRSWOR 

0.193 

0.207 

0.056 


fcNNI 

0.004 

0.053 

0.036 


bfcNNI 

-0.001 

0.052 

0.034 

Variance 

NNI 

-0.001 

0.094 

0.054 


PMM 

-0.001 

0.094 

0.052 


SRS 

0.372 

0.525 

0.356 


SRSWOR 

0.376 

0.473 

0.268 


fcNNI 

-0.003 

0.088 

0.061 


bfcNNI 

-0.008 

0.076 

0.044 


Table 3: Average over the simulations of the approximated conditional imputation 
variance of Expression (24), Monte Carlo imputation variance of the total and ratio 
of these two quantities in two different cases. 



Case 


1 

2 

Average approx, imputation variance 

8172.74 

1244589.00 

Monte Carlo imputation variance 

13146.21 

1327932.00 

Ratio 

0.62 

0.94 
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Proofs of the properties 

Proof of Property 1. 


E m Ej (?! - Y J = E m £ 4 ^ 

\i£Sr jeSm ieSr i€S 


y di(3 T xi + y dj'y - y cm 3 t x , 


j(E.Sm i€.S r 


= y [E/ (XjJ - XJ = 0, 

where the last equality comes from Equation (11). As it is supposed that the 
data is MAR, the expectation with respect to m, the one with respect to p, 
and the one with respect to q can be reversed. It therefore produces 

Bias (>>) = E m E p E g E/ (f 7 - f) = E m E p E g E 7 (? r — Y + Y — f) 

= E m E p EgE/ (Yj - Y^j = EpE q E m E/ (Yj - Y^j = 0. 

2. If yi = /3 T Xj, 

Yj - Ej (f 7 ) = (3 t % - E/ (/3 T X 7 ) = (3 T [x 7 - E 7 (x 7 )] « 0, 
where the last approximation comes from Remark 3 (page 14). Therefore 


Var 7 (f 7 ) = Ej F 7 - E/ (f 7 ) 2 » 0, 


Var,; mp = EpE^Vaiq ( Y r ) « 0 


Proof of Property 2. 


Ei (^) = y ^ + y ^y = y^ 1 + y 


X^Sq- j qrfi Sr 


d ij\ bk \ 


y d iT.Vi- 


If 0i is the true response probability, this last expression represents the propensity 
score adjusted estimator. Therefore 


E p E 9 E/ (f 7 ) = E P E ? (y dijyi\ = F, 


and consequently 


Bias IFF = EpE„E 7 F 7 - F = 0. 
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Proof of Property 3. 
We have 


E ' - p ) = E <i > E 4“'y< - E 

j€.Sm idzSr j€.Sm 

= E d > E - E 4 E 

j£Sm i£Sr j£Sm i£S r 

= J2 f\yi-yj) 

j£.Sm i^Sr 
= 0 , 

where the last equality comes from the fact that is nonzero only if i € knn(j) 
and from the hypothesis of the property. Therefore, it produces 

Bias (V>) = E p E q Ej (? T - e) = E p E q Ej (Yj -Y + Y- e) 

= E p E q Ej (?r - E) = 0. 


□ 


Proof of the Proposition 


The proof of the Proposition requires the four following lemmas. 

Lemma 1. Suppose that assumptions (Al) and (A2) hold. Then - converges in 
probability to 0. 

Proof. As E is a design unbiased estimator of E, we have 


E„ 


E-E 

N 


= 0. 


Moreover, we have 


Var,, 


E-E 

N 


N' 2 


E E Wi1T 

ieu jeu 1 J 


-ViVj 


s wEE 


= O 


+ h E vA 

ieu jeu 1 J ieu 1 

iAi 

1 

n 


where the last equality follows from assumptions (Al) and (A2). By Bienayme- 
Chebychev inequality, we conclude that Y ^ Y converges in probability to 0. □ 

Lemma 2. Suppose that assumptions (A2) and (A6) hold. Then 


Varj 


Yj-Y 

N 
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Proof. Assumption (A6) implies 


Var / 


Yi-Y 

N 


= ^Var r(U 


= SiE E c « rf ; (ifi - b T x 

i€.S r jtzSm 

E ^ 


j€S m iGSr 


[bfc] n m k 2 / _ b T N 2 

n m k-q j \ Vl ' 




= O 


where the last inequality comes from assumption (A2), from y^ — b T Xj = 0(1), and 
from = O (jr). □ 

Lemma 3. Suppose that assumptions (A2), (Af), and (A 7) hold. Then 

Yj-Y 


Var m E/ 


N 


= O 


Proof. Using E/ 



and assumption (A4), we get 


E/ 


Yi-Y 

N 


N I E d iE^ 


[bfc] 

ij 


Vi 


U'6S„ 


ieSr 



1 

N 


E d t E^lf ] ^ Tx * +£ *) 

jes m i£Sr 


E rfi(/3 T Xj+£j) 

j€Sm 
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Therefore, from assumption (A4) again, we obtain 


Var m E i 


Yj - Y 
N 

1 

= jV2 


EE djipfj I Var m (ej)+ ^2 d 2 jVar m (£j 
ieS r \jeS m ) j^Sm 

2 




= 


ElE^r + E«: 


ieSr \j€Sr, 


j£Sn 


N 2 


N 2 * 


E E d l4 t]2 +E E E ‘WifVr 1 + E + 


i(zSr j £ Sy f 


i&Sr j&Smi&Sm 
t+3 


j&Sr, 


E <? E dT + E E <*(* E dfd? 1 + E <t 2 

i€S m ieS r j&Sml&Sm i&Sr j&Sm 

1+3 


= o 


n 


where the last equality follows from assumption (A2), from assumption (A7), and 


from 4 fcl = O (£)• 


□ 


Lemma 4. Suppose that assumptions (A2) to (A7) hold. Then converges in 

probability to 0. 


Proof. Assumption (A4) implies that (see proof of Property 1) 


E m ,E j ( — I = 0. 


N 


(25) 


Then, by assumption (A3), we get 

(Y,-Y\ 


E. 


mpql 


— E p E (7 E m E/ 


Yj - Y 


= 0. 


(26) 


N J y N 

Moreover, from Lemma 2 and Lemma 3, we have 

VarmE/ (^T“) + EmVar/ (^T“) = °(l 

Assumption (A3), Equation (25) and Equation (26) together imply 


Var mpq i | n j — Y&iCpq m i ( ^ j ~ ^ f 


By Bienayme-Chebychev inequality, we conclude that Yi—1 converges in probability 


to 0. 


□ 
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Proof of the Proposition. 

The conclusion follows directly from equality 

Yj-Y %-Y ,Y-Y 
N ~ N + N ’ 

Lemma 1, and Lemma 4. □ 
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